2020 was a year that is likely to be remembered for one thing above all, the COVID-19 pandemic which has claimed some 3.5 million lives to date. In response to this threat, the United States Center for Disease Control as well as various international bodies and state authorities issued recommendations that citizens "shelter-in-place" and later "stay at home". Faced with the prospect of cabin fever, numerous hobbies experienced a surge in popularity, of which one was videogaming. Steam, a popular seller of PC games, reported an an all-time high concurrent user count in excess of 20 million users and Nintendo reported a continuation of its streak of selling over 2 million units monthly for its line of Nintendo Switch consoles. This proved a convenient backdrop for manufacturers Microsoft, Sony, AMD, and NVIDIA, who all announced new products including Xbox Series X and S and Playstation 5 consoles as well as a suite of new Graphics Processing Units (GPUs) for PC all scheduled for release from Fall of 2020 through to Spring of 2021.
Anticipation for these new products was high, and it was driven higher by bold promises from company spokespeople. NVIDIA promised gaming at resolutions up to 8k supported by an almost doubling of performance of the flagship RTX 3080 card over its predecessor, the RTX 2080. Microsoft and Sony each promised 4k gaming at up to 120 hz for \$500 or less. However, shortly after the launch of the first few products, it became apparent that all of these products were sold out at retailers and stories began to emerge of mass bot purchases by speculators looking to flip them at a heft markup on second-hand sale sites, over \\$1000 in the case of the Xbox and Playstation and even higher for some GPUs. One notable story in the United Kingdom claimed that a single scalper group had managed to obtain some 40 Playstation 5 consoles. The drastic inflation over MSRP incensed a vocal group online. The ongoing pandemic had left many people jobless despite ongoing recovery efforts, and outside parties making money off of scooping up stock before others could get them added insult to injury. Over the course of the past several months, hardware shortages have persisted, and the manufacturers expect that they will not clear up in the near future.
The purpose of this analysis is to examine the sales over time of a selection of these hardware from second-hand sources to detect patterns in pricing and determine best courses of action for individual consumers. While doing so, I will explain the process in a tutorial like manner to allow for replication.
There are multiple notable websites that are used for purchase and sale of hardware second-hand, including ebay, Facebook marketplace, and Craig's List. While there would be a more complete view provided if multiple of these sites were scraped, the sites do not necessarily follow similar procedures for making data accesible, so a substantial time cost would be incurred. I have instead only selected Ebay, which was chosen for three primary reasons. First, I have been able to confirm that it is possible to view records of sold items and to narrow these results by specific unit identification rather than only through text-matching search. This lowers the risk of accidentally scraping data for secondary products, such as a Playstation controller instead of a Playstation. Secondly, Ebay provides extensive documentation for multiple APIs and a developer program with free registration. Thirdly, Ebay has considerable longevity as a seller and captures a larger share of second-hand sales than the alternatives. However, since not all sales relevant to the discussion can be captured due to this narrowing, the predictive nature of the resulting analysis will be diminished, and its merit should be considered accordingly.
However, after proceeding with further research, it became clear that ebays own API would not work for my purposes. Access to sold listings in ebay is restricted to select developers, rather than broadly available, likely due to ebay offering a competing market research analytic tool for its sellers. Instead, I elected to utilize a scraper API, ProxyCrawl. This requires an account as well as a subscription after 1000 requests. However, this will still allow us to gather a large amount of data before this.
The Covid-19 pandemic directly affected the supply chain for the manufacture and sale of all products in this analysis. This has been cited by each of the manufacturers as a contributing factor to the ongoing shortages. Further, many or all of the products being compared were manufactured in China. On December 31, 2020, the exception to tariffs placed on Chinese goods which had been granted to certain computer hardware items expired. Also, during the period of sale for these products, cryptocurrencies experienced an explosive growth relative to the US dollar. This likely drove up demand for all GPUs due to their extensive use by cryptocurrency miners. The details of the function of cryptocurrency are beyond scope of this analysis, but their value in an additional area leads to an expectation that GPUs may experience greater markup on the second-hand market than consoles.
import pandas as pd
import numpy as np
import json
import proxycrawl
from time import sleep
from os.path import isfile
import matplotlib.pyplot as plt
import re
import ast
from sklearn import linear_model
import warnings ## Note: Only included for export of notebook purposes. Do not replicate in functional code.
warnings.filterwarnings('ignore')
| Library | Use Case | API Documentation url | ||
|---|---|---|---|---|
| Pandas | Data Table | https://pandas.pydata.org/pandas-docs/stable/ | ||
| Numpy | Mathematical Calculations | https://numpy.org/doc/ | ||
| Json | Value extraction from Scraped Data | https://docs.python.org/3/library/json.html | ||
| ProxyCrawl | Web Scraping | https://pypi.org/project/proxycrawl/ | ||
| Time | Delaying library Calls as Courtesy | https://docs.python.org/3/library/time.html | ||
| OS | Checking if Files Exist | https://docs.python.org/3/library/os.html | ||
| Matplotlib | Plotting Backend | https://matplotlib.org/stable/contents.html | ||
| Re | Pattern Matching | https://docs.python.org/3/library/re.html | ||
| AST | Evaluation of strings as Python Code | https://docs.python.org/3/library/ast.html | ||
| Sklearn | Data Modelling and Predictions | https://sklearn.org/documentation.html |
Since the number of calls that can be made to the ProxyCrawl API without incurring additional costs is limited, it is undesireable to repeat scraping upon repetition. Therefore, the scraping calls are only made if the CSV that will be used to store the data does not already exist. The name of the CSV must be changed or the CSV deleted if you desire to run the scraping more than once.
ALREADY_SCRAPED = isfile('ebay-scraping.csv')
print(ALREADY_SCRAPED)
True
URLS = {'xbx':'https://www.ebay.com/sch/i.html?_from=R40&_nkw=xbox+series+x&_sacat=0&Model=Microsoft%2520Xbox%2520Series%2520X&_dcat=139971&LH_Sold=1&LH_Complete=1&rt=nc&LH_PrefLoc=1', 'xbs': 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=xbox+series+s&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&LH_Complete=1&LH_Sold=1&rt=nc&Platform=Microsoft%2520Xbox%2520Series%2520X%257CS&_dcat=139971','ps5_disk':'https://www.ebay.com/sch/i.html?_from=R40&_nkw=playstation+5&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&LH_Complete=1&LH_Sold=1&rt=nc&Model=Sony%2520PlayStation%25205%2520Blu%252DRay%2520Edition&_dcat=139971', 'ps5_dig':'https://www.ebay.com/sch/i.html?_from=R40&_nkw=playstation+5&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&LH_Complete=1&LH_Sold=1&_oaa=1&rt=nc&Model=Sony%2520PlayStation%25205%2520Digital%2520Edition&_dcat=139971', 'rtx3080':'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1311&_nkw=nvidia+rtx+3080&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&_osacat=0&_odkw=xbox+series+s&LH_Complete=1&_dcat=139971&rt=nc&LH_Sold=1&Platform=Microsoft%2520Xbox%2520Series%2520X%257CS', 'rtx3070' : 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=nvidia+rtx+3070&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&_osacat=0&LH_Complete=1&LH_Sold=1', 'rx6800':'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=amd+rx+6800&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&_osacat=0&LH_Complete=1&LH_Sold=1', 'rx6900':'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=amd+rx+6900&_sacat=0&LH_TitleDesc=0&LH_PrefLoc=1&_osacat=0&LH_Complete=1&LH_Sold=1'}
PLATFORM_INFO = {'xbx':{'type':'console', 'base_price': 499.99, 'company':'MSFT'}, 'xbs':{'type':'console','base_price':299.99, 'company':'MSFT'}, 'ps5_disk':{'type':'console', 'base_price':499.99, 'company':'SONY'}, 'ps5_dig':{'type':'console', 'base_price':399.99, 'company':'SONY'}, 'rtx3080':{'type':'GPU', 'base_price':699.00, 'company':'NVDA'}, 'rtx3070': {'type':'GPU', 'base_price': 499.00, 'company':'NVDA'}, 'rx6800':{'type':'GPU', 'base_price':579.99, 'company':'AMD'}, 'rx6900':{'type':'GPU', 'base_price':999.99, 'company':'AMD'}}
PAGES_TO_GET = {'xbx':100, 'xbs':5, 'ps5_disk':100, 'ps5_dig':100, 'rtx3080':50, 'rtx3070':100, 'rx6800':10, 'rx6900':10}
ProxyCrawl is not a free API. The python binding included link to a server backend online, so it requires an account with an access token to make scraping requests. For privacy, my token is stored as a text file (not included) which I must read back into memory.
with open('proxycrawl-token.txt', 'r') as file: ## Read in access taken for ProxyCrawl;
token = file.read().replace('\n', '')
file.close()
The relevant API within ProxyCrawl. is the Crawling API and the 'ebay-serp' scraper. It collecs JSON data on all listings on a particular ebay search page. However, a limitation it possesses is a lack of sell date information, since it was designed for checking items on sale, rather than sold item records. The sale price is still recorded, but since it is impossible to gather the time data which was desired for time series analysis, it is necessary to take a step backwards and reframe the question being investigated. In order to proceed to meaningful analysis, the scope of the project is narrowed to consider comparitative pricing between similar but distinct products, rather than the trends of individual products.
api = proxycrawl.CrawlingAPI({'token' : token})
OPTIONS = {'scraper':'ebay-serp'}
def scrape(url, options, max_pages): ## Create a scrape function that can be invoked for each product, as they each need their
listings = [] ## own url and number of pages. This decreases the total amount of code to write.
url1 = url ## Create a new url so that it can be modified to scrape new pages as the loop continues.
for i in range(0,max_pages):
if i > 0: ## To scrape pacges after the first, must add a pagenumber for the search results.
url1 = url + '_&pgn=' + str(i+1)
response = api.get(url1, options=options)
if response['status_code'] == 200: ## Data is available only if the status code returns 200. Otherwise, must debug
x = json.loads(response['body']) ## Proxy-Crawl's ebay-serp scraper returns search listings as JSON. Load it in and
listings.extend(x['body']['products']) ## add the products to the list
else:
print('Error while crawling data. Response Code: ' + str(response['status_code']) + ', page: ' + str(i+1)) ## Error Code, nonfatal
if i < max_pages - 1:
sleep(1 / 20) ## Don't send requests too frequently, otherwise the proxy stops working.
return listings
def price_transform(item): ## Prices are gathered as nested elements. Must de-nest, will convert later
from_s = item['price']['current']['from']
to_s = item['price']['current']['to']
trend_s = item['price']['trendingPrice']
if trend_s != None: ## From time to time, ebay will report a trending price for items. We'll gather that too.
item['trendingPrice'] = trend_s
else:
item['trendingPrice'] = None
if from_s != None:
if '$' == from_s[0:1]: ## US dollars will included either a dollar sign, or the code USD. Either case
item['sale_price'] = from_s[1:] ## can be used, but we must remove this anyway so that proper float casting can
elif 'USD' == from_s[0:3]: ## take place.
item['sale_price'] = from_s[3:]
else:
item['sale_price'] = None ## If the sale is listed in another currency, it can't be used here at scale.
elif to_s != None:
if '$' == to_s[0:1]:
item['sale_price'] = to_s[1:]
elif 'USD' == to_s[0:3]:
item['sale_price'] = to_s[3:]
else:
item['sale_price'] = None
else:
item['sale_price'] = None
item.pop('price', None)
return item
def add_key_parameters(item, model): ## Add in the individual product information to each item, so that it can be used later
item['model'] = model
item['type'] = PLATFORM_INFO[model]['type']
item['base_price'] = PLATFORM_INFO[model]['base_price']
item['company'] = PLATFORM_INFO[model]['company']
In order to create recoverable data in the event of unexpected code behavior or crashes, I have set up the code to backup the current items after each set is scraped. This allows for partial re-scraping rather than having to begin the process from scratch. I settled on backing up three times due to the transformations I implemented in order to make the data usable.
if not ALREADY_SCRAPED: ## The actual scraping loop. Only used if the scrape hasn't already been performed.
products = []
for x in URLS.keys():
print(x)
items = scrape(URLS[x], OPTIONS, PAGES_TO_GET[x]) ## Get response data for the number of pages for the product
backup1 = pd.DataFrame.from_dict(items) ## Backup the full list of responses to csv
backup1.to_csv('backups/raw_{}.csv'.format(x))
items = list(map(price_transform, items)) ## Begin parsing out price info
items = list(filter(lambda x: x['sale_price'] != None, items)) ## Remove entries without a valid price
backup2 = pd.DataFrame.from_dict(items)
backup2.to_csv('backups/transform_{}.csv'.format(x)) ## Backup data with price info (still a string).
for y in items:
add_key_parameters(y, x) ## Add product info for each entry
backup3 = pd.DataFrame.from_dict(items)
backup3.to_csv('backups/full_{}.csv'.format(x)) ## Backup entries + prices (string) + product info
products.extend(items)
sleep(1/20) ## Delay as courtesy to ProxyCrawl backend
df = pd.DataFrame.from_dict(products)
df.to_csv('ebay-scraping.csv') ## Save all entries to the final file.
Once the data has been scraped, we read it into memory from the data file, and proceed to examination and cleaning.
df = pd.read_csv('ebay-scraping.csv', index_col=0) ## Read in saved file
df.head()
| title | subTitles | bidsCount | hotness | additionalHotness | customerReviews | shippingMessage | image | url | location | time | listingDate | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | NaN | [] | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | https://i.ebayimg.com/thumbs/images/g/ZLoAAOSw... | https://www.ebay.com/itm/224461848780?epid=140... | NaN | {'timeLeft': '', 'timeEnd': ''} | NaN | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | NaN | [] | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | https://i.ebayimg.com/thumbs/images/g/K3UAAOSw... | https://www.ebay.com/itm/303997391845?epid=140... | NaN | {'timeLeft': '', 'timeEnd': ''} | NaN | False | False | NaN | 748.88 | xbx | console | 499.99 | MSFT |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | NaN | [] | {'review': '', 'count': 1087, 'link': 'https:/... | +$11.99 shipping | https://i.ebayimg.com/thumbs/images/g/PhgAAOSw... | https://www.ebay.com/itm/144038432213?epid=140... | NaN | {'timeLeft': '', 'timeEnd': ''} | NaN | False | False | NaN | 690.00 | xbx | console | 499.99 | MSFT |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | NaN | [] | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | https://i.ebayimg.com/thumbs/images/g/-IMAAOSw... | https://www.ebay.com/itm/164859267168?epid=140... | NaN | {'timeLeft': '', 'timeEnd': ''} | NaN | False | False | NaN | 750.00 | xbx | console | 499.99 | MSFT |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | NaN | [] | {'review': '', 'count': 1087, 'link': 'https:/... | +$65.94 shipping | https://i.ebayimg.com/thumbs/images/g/Xf0AAOSw... | https://www.ebay.com/itm/114805916376?epid=140... | from Canada | {'timeLeft': '', 'timeEnd': ''} | NaN | False | False | NaN | 643.78 | xbx | console | 499.99 | MSFT |
Several of the columns included in the data provided by the API are not useful for this analysis due to either an absence of any information (in the case of hotness, additionalHotness, time, and listingDate), a lack of direct relevance for analysis (url and image), or a combination of lack of information, lack of distinction, and lack of clarity in the case of location. Therefore all of these variables will be tossed away before further cleaning efforts are taken.
df = df.drop(columns=['hotness','additionalHotness','image','url','time','listingDate', 'location'])
df.head()
| title | subTitles | bidsCount | customerReviews | shippingMessage | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | NaN | 748.88 | xbx | console | 499.99 | MSFT |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | {'review': '', 'count': 1087, 'link': 'https:/... | +$11.99 shipping | False | False | NaN | 690.00 | xbx | console | 499.99 | MSFT |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | NaN | 750.00 | xbx | console | 499.99 | MSFT |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | {'review': '', 'count': 1087, 'link': 'https:/... | +$65.94 shipping | False | False | NaN | 643.78 | xbx | console | 499.99 | MSFT |
Due to the scraping process, the prices and shipping costs are still stored as strings. In order to work with them, it is necessary to convert them to floats, but this must account for some irregularities in display. These prices may have commas or spaces in them, so either must be removed.
sale_price = [] ## Convert prices from strings to floats with necessary cleanging steps
for t in df.itertuples():
s = t.sale_price
s = s.replace(",", "") ## Need to remove commas
s = s.replace(" ","") ## Need to remove spaces
x = float(s)
sale_price.append(x)
df['sale_price'] = sale_price ## Overwrite the columns they were stored in
df.head()
| title | subTitles | bidsCount | customerReviews | shippingMessage | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | NaN | 748.88 | xbx | console | 499.99 | MSFT |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | {'review': '', 'count': 1087, 'link': 'https:/... | +$11.99 shipping | False | False | NaN | 690.00 | xbx | console | 499.99 | MSFT |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | NaN | 750.00 | xbx | console | 499.99 | MSFT |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | {'review': '', 'count': 1087, 'link': 'https:/... | +$65.94 shipping | False | False | NaN | 643.78 | xbx | console | 499.99 | MSFT |
Trending Price is only listed infrequently for products. I used forward filling of these values under two assumptions: each product should have a singular trending price that it shares with all other similar products, and each product should have the trending price attached to the first element. As it later turned out, the second assumption was incorrect altogether, as no trending price was given at all for certain products. The more appropriate decision would actually have been to toss the column, since there wasn't a "correct" filling for some of the products of interest. As a result it does not factor into the analysis greatly.
df['trendingPrice'] = df['trendingPrice'].fillna(method = 'ffill')
df.head()
| title | subTitles | bidsCount | customerReviews | shippingMessage | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | $739.95 | 748.88 | xbx | console | 499.99 | MSFT |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | {'review': '', 'count': 1087, 'link': 'https:/... | +$11.99 shipping | False | False | $739.95 | 690.00 | xbx | console | 499.99 | MSFT |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | Free shipping | False | False | $739.95 | 750.00 | xbx | console | 499.99 | MSFT |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | {'review': '', 'count': 1087, 'link': 'https:/... | +$65.94 shipping | False | False | $739.95 | 643.78 | xbx | console | 499.99 | MSFT |
The cost of shipping any of these products is stored within a string, so I needed to figure out a way of extracting them. I chose to use pattern matching through regular expressions. The catch was if there was no match to the pattern, I assume that the shipping was "free shipping," hence used zero as the value.
shipping = []
for l in df.itertuples():
if '$' in str(l.shippingMessage): ## Extract the prices (USD) listed in the shipping message.
m = re.search(r'[0-9]+.[0-9]*', l.shippingMessage)
s = m.group(0)
x = float(s)
shipping.append(x)
else:
shipping.append(0)
df['shipping'] = shipping
df.drop(inplace=True, columns=['shippingMessage'])
df.head()
| title | subTitles | bidsCount | customerReviews | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT | 0.00 |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | False | False | $739.95 | 748.88 | xbx | console | 499.99 | MSFT | 0.00 |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | {'review': '', 'count': 1087, 'link': 'https:/... | False | False | $739.95 | 690.00 | xbx | console | 499.99 | MSFT | 11.99 |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | {'review': '', 'count': 1087, 'link': 'https:/... | False | False | $739.95 | 750.00 | xbx | console | 499.99 | MSFT | 0.00 |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | {'review': '', 'count': 1087, 'link': 'https:/... | False | False | $739.95 | 643.78 | xbx | console | 499.99 | MSFT | 65.94 |
Number of reviews was not extracted from the data prior to storing it as a CSV file. As a result, I needed to convert from the stored string to a dictionary structure, and from there extract the relevant value. Here, I made use of the AST libraries ability to evaluate a string as python code. This could introduce security vulnerabilities in the code in theory, but the risks in this case are fairly small.
reviewCount = []
for x in df.itertuples(): ## Extract the number of reviews a given listing has.
a = ast.literal_eval(x.customerReviews)
reviewCount.append(int(a['count']))
df['review_count'] = reviewCount
df = df.drop(columns=['customerReviews'])
df.head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT | 0.00 | 1087 |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | False | False | $739.95 | 748.88 | xbx | console | 499.99 | MSFT | 0.00 | 1087 |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | False | False | $739.95 | 690.00 | xbx | console | 499.99 | MSFT | 11.99 | 1087 |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | False | False | $739.95 | 750.00 | xbx | console | 499.99 | MSFT | 0.00 | 1087 |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | False | False | $739.95 | 643.78 | xbx | console | 499.99 | MSFT | 65.94 | 1087 |
For my first major transformations I calculated dollar markups for each sale as well as the ratio of sale price to MSRP in order to normalize the price increases to each other. This allows for the various products to be more readily compared or to be plotted on the same axes.
df['markup'] = df['sale_price'] - df['base_price'] ## Markup = sale price - base price
df['multiplier'] = df['sale_price'] / df['base_price'] ## Price Multiplier = sale price / base price
df.head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | markup | multiplier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | False | False | $739.95 | 729.99 | xbx | console | 499.99 | MSFT | 0.00 | 1087 | 230.00 | 1.460009 |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | False | False | $739.95 | 748.88 | xbx | console | 499.99 | MSFT | 0.00 | 1087 | 248.89 | 1.497790 |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | False | False | $739.95 | 690.00 | xbx | console | 499.99 | MSFT | 11.99 | 1087 | 190.01 | 1.380028 |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | False | False | $739.95 | 750.00 | xbx | console | 499.99 | MSFT | 0.00 | 1087 | 250.01 | 1.500030 |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | False | False | $739.95 | 643.78 | xbx | console | 499.99 | MSFT | 65.94 | 1087 | 143.79 | 1.287586 |
The first plot I chose to make was the base price of each sale in US Dollars vs. the Sale Price. This should show the degree to which sale price may depend on the MSRP of each product.
df.plot.scatter('base_price', 'sale_price', title='Base Price USD vs Sale Price USD', xlabel='MSRP in Dollars', ylabel = 'Sale Price in Dollars')
<AxesSubplot:title={'center':'Base Price USD vs Sale Price USD'}, xlabel='MSRP in Dollars', ylabel='Sale Price in Dollars'>
At this point it became apparent that some unusual data had been entered into the data set. Specifically, there were some items that appeared to have prices far below even the cheepest item for sale. Therefore, I decided to sort the data set to see if these had been properly included or could be explained out of the data set.
df.sort_values(by='sale_price').head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | markup | multiplier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2313 | Microsoft Xbox Series X 1TB SSD (Empty Box Only) | ['Open Box · Microsoft Xbox'] | 0 | False | False | $749.99 | 29.99 | xbx | console | 499.99 | MSFT | 21.0 | 0 | -470.0 | 0.059981 |
| 713 | Microsoft Xbox Series X 1TB SSD (Empty Box Only) | ['Open Box · Microsoft Xbox'] | 0 | False | False | $749.99 | 29.99 | xbx | console | 499.99 | MSFT | 21.0 | 0 | -470.0 | 0.059981 |
| 63 | Microsoft Xbox Series X 1TB SSD (Empty Box Only) | ['Open Box · Microsoft Xbox'] | 0 | False | False | $749.99 | 29.99 | xbx | console | 499.99 | MSFT | 21.0 | 0 | -470.0 | 0.059981 |
| 413 | Microsoft Xbox Series X 1TB SSD (Empty Box Only) | ['Open Box · Microsoft Xbox'] | 0 | False | False | $749.99 | 29.99 | xbx | console | 499.99 | MSFT | 21.0 | 0 | -470.0 | 0.059981 |
| 363 | Microsoft Xbox Series X 1TB SSD (Empty Box Only) | ['Open Box · Microsoft Xbox'] | 0 | False | False | $749.99 | 29.99 | xbx | console | 499.99 | MSFT | 21.0 | 0 | -470.0 | 0.059981 |
The five cheapest options, each sold for a price of $29.99 are labelled 'Empty Box Only,' indicating that they at least were outliers due to improper inclusion in the data set, hence they should be removed before further analysis is performed. The exact reason for their existence as sales within ebay is not relevant, but may indicate attempts by sellers to trick unsuspecting shoppers or automated scalping systems.
I took steps to remove each sale item labelled as an empty box, then checked the data set again to see if any other additional low end outliers were apparent.
not_empty_box = []
for t in df.itertuples(): ## Find all entries marked as empty boxes and mark them for removal.
not_empty_box.append('Empty Box' not in t.title)
df['not_empty_box'] = not_empty_box
df = df[df['not_empty_box']] ## Keep only entries that are not for empty boxes.
df.drop(inplace=True, columns=['not_empty_box'])
df.sort_values(by='sale_price').head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | markup | multiplier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | Microsoft Wireless Controller for Xbox Series ... | ['Brand New'] | 6 | False | False | $739.95 | 37.00 | xbx | console | 499.99 | MSFT | 22.83 | 53 | -462.99 | 0.074001 |
| 2018 | New ListingTwo Microsoft Wireless Controller f... | ['Brand New'] | 0 | False | False | $739.95 | 78.00 | xbx | console | 499.99 | MSFT | 33.09 | 0 | -421.99 | 0.156003 |
| 2631 | NEW GIGABYTE GeForce RTX 30 80 VISION OC 10GB ... | ['Brand New'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.00 | NVDA | 0.00 | 17 | -214.31 | 0.693405 |
| 2873 | Nuevo GIGABYTE GeForce RTX 30 80 Vision OC 10G... | ['Totalmente nuevo'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.00 | NVDA | 0.00 | 17 | -214.31 | 0.693405 |
| 3262 | NEW GIGABYTE GeForce RTX 30 80 VISION OC 10GB ... | ['Brand New'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.00 | NVDA | 0.00 | 17 | -214.31 | 0.693405 |
This left another two invalid inclusions revealed, with the new two lowest prices being wireless controllers for Microsoft's Xbox platform. Therefore another round of filtering wass necessary, with a largely identical process.
This was followed up with a further check.
not_controller = []
for t in df.itertuples(): ## Find all controller entries and mark them for removal.
not_controller.append('Controller' not in t.title)
df['not_controller'] = not_controller
df = df[df['not_controller']] ## Keep only entries that are not for controllers.
df.drop(inplace=True, columns=['not_controller'])
df.sort_values(by='sale_price').head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | markup | multiplier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2631 | NEW GIGABYTE GeForce RTX 30 80 VISION OC 10GB ... | ['Brand New'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.0 | NVDA | 0.0 | 17 | -214.31 | 0.693405 |
| 3262 | NEW GIGABYTE GeForce RTX 30 80 VISION OC 10GB ... | ['Brand New'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.0 | NVDA | 0.0 | 17 | -214.31 | 0.693405 |
| 2873 | Nuevo GIGABYTE GeForce RTX 30 80 Vision OC 10G... | ['Totalmente nuevo'] | 0 | False | False | $779.00 | 484.69 | rtx3080 | GPU | 699.0 | NVDA | 0.0 | 17 | -214.31 | 0.693405 |
| 3261 | * NEW * SEALED Asus Nvidia Geforce RTX 3080i R... | ['Brand New'] | 0 | False | False | $779.00 | 537.44 | rtx3080 | GPU | 699.0 | NVDA | 0.0 | 0 | -161.56 | 0.768870 |
| 2630 | * NEW * SEALED Asus Nvidia Geforce RTX 3080i R... | ['Brand New'] | 0 | False | False | $779.00 | 537.44 | rtx3080 | GPU | 699.0 | NVDA | 0.0 | 0 | -161.56 | 0.768870 |
While there were still unusually low sale prices listed for the RTX 3080, there was no apparent way to explain these away, so they had to remain in the data set.
I repeated the original plot to see if there was a pattern.
df.plot.scatter('base_price', 'sale_price', title='Base Price USD vs Sale Price USD', xlabel='MSRP in Dollars', ylabel = 'Sale Price in Dollars')
<AxesSubplot:title={'center':'Base Price USD vs Sale Price USD'}, xlabel='MSRP in Dollars', ylabel='Sale Price in Dollars'>
The scatter plot revealed a fairly strong positive correlation between MSRP and Price of sale.
df.plot.scatter('base_price', 'multiplier', title='Base Price USD vs Price Multiplier', xlabel='MSRP in Dollars', ylabel = 'Price Multiplier')
<AxesSubplot:title={'center':'Base Price USD vs Price Multiplier'}, xlabel='MSRP in Dollars', ylabel='Price Multiplier'>
Adjusting the markup to the multiplier value revealed no apparent correlation between MSRP and Price Multiplier, which suggests that a linear model may be sufficient for relating the ultimate sale price to its base value.
I continued with histograms of the prices of the products to examine for the approximate normality demanded for most simple statistical models.
df.hist(column='sale_price', by='model', figsize=(8,6))
array([[<AxesSubplot:title={'center':'ps5_dig'}>,
<AxesSubplot:title={'center':'rtx3070'}>],
[<AxesSubplot:title={'center':'rtx3080'}>,
<AxesSubplot:title={'center':'rx6800'}>],
[<AxesSubplot:title={'center':'rx6900'}>,
<AxesSubplot:title={'center':'xbx'}>]], dtype=object)
On the whole, the histograms for the sale price look somewhat normal, though they are not truly so. However, given how many members of the samples that have been taken, it should still be fine to use the assumption of normality in analyses. Since every item that exists within the data set has been plotted except for those removed due to improper inclusion, this chart revealed that at least two expected products did not make it into the data set: the disk version of the Playstation 5 and the Xbox Series S. Checking in the backup files for these revealed that no data on the PS5 can be found and the same holds for the Xbox Series S. I am unable to ascertain why this data is missing, but the only option available to me at this point wass to proceed anyway by removing analysis of them from the scope of the examination, though I will attempt to guess an approximate market price for them based on their similarities to the data that was successfully collected. However, additional research into the matter revealed that at least the XBox Series S has returned to availability directly from Microsoft which has drastically undercut the second-hand market's supply.
Comparison of the axes of thes histograms indicate that the true mean sale price of the different items may differ, so individual item identification will be included in the modelling step.
df.boxplot(column='markup', by='model')
<AxesSubplot:title={'center':'markup'}, xlabel='model'>
This boxplot shows a stronger implication than the previous histogram. It suggests that not only the sale prices differ between the items, but further that the average markups may as well.
I decided to see if this pattern held within the product types.
df.boxplot(column='multiplier', by='type')
<AxesSubplot:title={'center':'multiplier'}, xlabel='type'>
This suggests that the markup as a percentage of MSRP is higher for GPUs than it is for consoles. This pattern is in keeping with the popular hypothesis that demand for graphics cards may be driven in part by an ongoing cryptocurrency boom. However, this would be better evidenced by longitudinal pattern matching against a price index. This further suggests that the difference between a console and a GPU will be valuable for creation of a model. However, since there is no model of product that is both a GPU and a console, it is better to only use the individual models rather than include variables for whether an item is a GPU or a console, excepting the desire to make predictions about hardware that is not within the survey.
df.boxplot(column='multiplier', by='company')
<AxesSubplot:title={'center':'multiplier'}, xlabel='company'>
It appears that the differences between types also exist between companies, even within the same type of product, as evidenced by the apparent gap between Nvidia and AMD as well as between Sony and Microsoft. Therefore, a company dummy could be included in the final model, however, this would ultimately be a less precise way of comparing different models, since no model is made by two separate companies, so it will be excluded.
df.plot.scatter('sale_price', 'shipping', title='Sale Price vs. Shipping Price', xlabel='Sale Price in USD', ylabel='Shipping Price in USD')
<AxesSubplot:title={'center':'Sale Price vs. Shipping Price'}, xlabel='Sale Price in USD', ylabel='Shipping Price in USD'>
There seems to be a weak inverse correlation between shipping price and sales price. However, this may be susceptible to outliers, such as the two highest shipping prices, each exceeding \$350 with the greater exceeding \$500. This merits investigation.
df.sort_values(by='shipping', ascending=False).head()
| title | subTitles | bidsCount | topRatedSeller | sponsoredAd | trendingPrice | sale_price | model | type | base_price | company | shipping | review_count | markup | multiplier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2548 | 🔥Sony Playstation PS5 - DIGITAL EDITION BRAND ... | ['Brand New · Sony PlayStation 5 · Sony PlaySt... | 30 | False | False | $779.00 | 860.00 | ps5_dig | console | 399.99 | SONY | 555.62 | 306 | 460.01 | 2.150054 |
| 712 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 9 | False | False | $749.99 | 699.58 | xbx | console | 499.99 | MSFT | 391.31 | 1087 | 199.59 | 1.399188 |
| 2312 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 9 | False | False | $749.99 | 699.58 | xbx | console | 499.99 | MSFT | 391.31 | 1087 | 199.59 | 1.399188 |
| 362 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 9 | False | False | $749.99 | 699.58 | xbx | console | 499.99 | MSFT | 391.31 | 1087 | 199.59 | 1.399188 |
| 62 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 9 | False | False | $749.99 | 699.58 | xbx | console | 499.99 | MSFT | 391.31 | 1087 | 199.59 | 1.399188 |
While there appear to be several entries with extreme shipping costs each, and four of them appear to be identical, it is not clear that they are outliers due to error, and the shipping location of United States provides no further clues. We cannot eliminate these points, but their existence suggests that shipping price should not be overemphasized in the final model. However, this does suggest considering the sum of sale price and shipping cost, as that is the total price paid by the consumer.
df['sale+ship'] = df['sale_price'] + df['shipping']
df.hist(column='sale+ship', by='model', figsize=(8,6))
array([[<AxesSubplot:title={'center':'ps5_dig'}>,
<AxesSubplot:title={'center':'rtx3070'}>],
[<AxesSubplot:title={'center':'rtx3080'}>,
<AxesSubplot:title={'center':'rx6800'}>],
[<AxesSubplot:title={'center':'rx6900'}>,
<AxesSubplot:title={'center':'xbx'}>]], dtype=object)
The histograms of sales price plus shipping do not appear to differ drastically from the previous histograms. So, I decided not to include them in the final model due to its limited usefulness for prediction, especially since such a large majority of shipping was listed as "free".
Next, I chose to test bid count for correlation with the price multiplier for a given sold item, under the theory that an item that is heavily upon is likely to end up costing more.
df.plot.scatter('bidsCount', 'multiplier', title='Number of bids vs Price Multiplier')
<AxesSubplot:title={'center':'Number of bids vs Price Multiplier'}, xlabel='bidsCount', ylabel='multiplier'>
It is unclear whether the number of bids made on an item predicts how expensive it will become from the adjusted plot. It is worthwhile to investigate whether this may be clarified by controlling for the which item is being bid on.
for x in df['model'].unique(): ## Scatter bids count against price multiplier for each item type
ax = df[df['model'] == x].plot.scatter(x='bidsCount', y='multiplier', title='Number of Bids vs. Price Multiplier, {}'.format(x))
ind = df[df['model'] == x]['bidsCount']
dep = df[df['model'] == x]['multiplier']
m,b = np.polyfit(ind,dep,1) ## Create simple linear regression for assessing correlation
ax.plot(ind, m*ind + b, color='red')
display(ax)
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, xbx'}, xlabel='bidsCount', ylabel='multiplier'>
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, ps5_dig'}, xlabel='bidsCount', ylabel='multiplier'>
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, rtx3080'}, xlabel='bidsCount', ylabel='multiplier'>
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, rtx3070'}, xlabel='bidsCount', ylabel='multiplier'>
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, rx6800'}, xlabel='bidsCount', ylabel='multiplier'>
<AxesSubplot:title={'center':'Number of Bids vs. Price Multiplier, rx6900'}, xlabel='bidsCount', ylabel='multiplier'>
By breaking out these plots by individual models, it was revelaed that the relationship between number of bids and price multiplier is weak for most items and negative in many cases.This suggests that it may not be a good predicter for sale price. This is further suggested since most sales appear to have occured with no recorded bids. An alternative model that could explain this is that people are more likely to bid on something that is currently going for a cheap amount than something that is going for an expensive amount, but the cause of this is unclear. It could alternatively be related to the ability to make purchases outright in some listings rather than compete in bidding. As a result of this lack of clear correlation, the inability to predict number of bids for a given item not in the data set, and the predominance of no bid sales, I decided to exclude the number of bids from the final model.
Lastly, before creating the models and making predictions, I checked three seller characteristics to see if they should be included in the model: number of reviews, whether the seller is "top-rated", and whether the listing was a sponsored ad.
df.hist(column='multiplier', by='topRatedSeller', figsize=(6,3))
array([<AxesSubplot:title={'center':'False'}>,
<AxesSubplot:title={'center':'True'}>], dtype=object)
df.hist(column='multiplier', by='sponsoredAd', figsize=(6,3))
<AxesSubplot:title={'center':'False'}>
df.plot.scatter(x='review_count', y='multiplier')
<AxesSubplot:xlabel='review_count', ylabel='multiplier'>
From these three graphs it appeared that there were too few 'Top Rated Sellers' among our sample to be of use, there are no sponsored ad listings, and the number of reviews seems to be correlated with a slightly lower price, which appears largely driven by a small number of listings with a high number of reviews, so none of these three variables will be included in the model.
df = df.drop(columns=['topRatedSeller','sponsoredAd','review_count'])
df.head()
| title | subTitles | bidsCount | trendingPrice | sale_price | model | type | base_price | company | shipping | markup | multiplier | sale+ship | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | $739.95 | 729.99 | xbx | console | 499.99 | MSFT | 0.00 | 230.00 | 1.460009 | 729.99 |
| 1 | New ListingMicrosoft Xbox Series X 1TB Video G... | ['Brand New · Microsoft Xbox Series X'] | 0 | $739.95 | 748.88 | xbx | console | 499.99 | MSFT | 0.00 | 248.89 | 1.497790 | 748.88 |
| 2 | Microsoft Xbox Series X 1TB Console - Black *I... | ['Brand New · Microsoft Xbox Series X'] | 10 | $739.95 | 690.00 | xbx | console | 499.99 | MSFT | 11.99 | 190.01 | 1.380028 | 701.99 |
| 3 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New · Microsoft Xbox · Microsoft Xbox ... | 0 | $739.95 | 750.00 | xbx | console | 499.99 | MSFT | 0.00 | 250.01 | 1.500030 | 750.00 |
| 4 | Microsoft Xbox Series X 1TB Video Game Console... | ['Brand New'] | 21 | $739.95 | 643.78 | xbx | console | 499.99 | MSFT | 65.94 | 143.79 | 1.287586 | 709.72 |
I decided to create two models. The first one is for analyzing the current prices of each object in the dataset to attempt to model their value. This includes dummies for the individual items. This should result in the average sale price of each of the items for which we have data, and would allow for assessing the accuracy of ebay's 'trending price' metric.
The second model is for attempted extrapolation of desired items not contained in the dataset. This should use the companies selling each item, its base price, and whether it is a GPU times the base price as predictors since the price multiplier on sale appears greater for GPUs than for consoles. This should allow for the estimation of likely fair market prices of additional consoles and GPUs, such as the absent Playstation 5 with disk and Xbox Series S as well as the RTX 30606Ti or RX 6700XT.
m1_df = pd.get_dummies(df, prefix='item', columns=['model'], drop_first=True)
m1_df.drop(inplace=True, columns=['title', 'subTitles', 'bidsCount', 'trendingPrice', 'company', 'shipping', 'markup', 'multiplier', 'sale+ship', 'type', 'base_price'])
m1_df.head()
| sale_price | item_rtx3070 | item_rtx3080 | item_rx6800 | item_rx6900 | item_xbx | |
|---|---|---|---|---|---|---|
| 0 | 729.99 | 0 | 0 | 0 | 0 | 1 |
| 1 | 748.88 | 0 | 0 | 0 | 0 | 1 |
| 2 | 690.00 | 0 | 0 | 0 | 0 | 1 |
| 3 | 750.00 | 0 | 0 | 0 | 0 | 1 |
| 4 | 643.78 | 0 | 0 | 0 | 0 | 1 |
From the columns, I was able to determine that the missing product was the digital Playstation 5. This is necessary for going back and using the model after it is created.
Y = m1_df['sale_price'] ## Target values are sale price
X = m1_df.drop(columns =['sale_price']) ## Independent variables are the set of item dummies
model1 = linear_model.LinearRegression()
est = model1.fit(X,Y) ## Fit the model to these values
t = 'Sale_price = {}'.format(est.intercept_) ## Print out the linear regression model
i = 0
for x in X.columns:
if x == 'sale_price':
continue
t = t + ' + {}*{}'.format(est.coef_[i], x)
i = i + 1
print(t)
Sale_price = 809.6049999999866 + 816.1320595744735*item_rtx3070 + 1469.0079663213182*item_rtx3080 + 795.4475333333407*item_rx6800 + 1157.5986500000051*item_rx6900 + -63.43359336099297*item_xbx
## Print out the estimated average price for each item represented in the data set
y_bar = est.predict([[0,0,0,0,0]])[0]
print('PS5 (Digital): ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/400))
y_bar = est.predict([[0,0,0,0,1]])[0]
print('Xbox Series X: ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/500))
y_bar = est.predict([[1,0,0,0,0]])[0]
print('RTX 3070: ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/500))
y_bar = est.predict([[0,0,1,0,0]])[0]
print('RX 6800: ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/579))
y_bar = est.predict([[0,1,0,0,0]])[0]
print('RTX 3080: ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/700))
y_bar = est.predict([[0,0,0,1,0]])[0]
print('RX 6900XT: ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/1000))
PS5 (Digital): $809.60, Multiplier: 2.024 Xbox Series X: $746.17, Multiplier: 1.492 RTX 3070: $1625.74, Multiplier: 3.251 RX 6800: $1605.05, Multiplier: 2.772 RTX 3080: $2278.61, Multiplier: 3.255 RX 6900XT: $1967.20, Multiplier: 1.967
| Item | MSRP | Avg. Price | Trending Price | Multiplier | ||
|---|---|---|---|---|---|---|
| PS5 Digital | \$400 | \$809.60 | \$780.00 | 2.024 | ||
| Xbox Series X | \$500 | \$746.17 | \$739.95 | 1.492 | ||
| RTX 3070 | \$500 | \$1625.74 | \$2386.13 | 3.251 | ||
| RTX 3080 | \$700 | \$2278.61 | Not Available | 3.255 | ||
| RX 6800 | \$580 | \$1605.05 | \$1399.00 | 2.772 | ||
| RX 6900XT | \$1000 | \$1967.20 | Not available | 1.967 |
It appears that there is a preference among consumers for Nvidia and Sony products over the Microsoft and AMD products, as sale prices of cheaper items from Nvidia and Sony exceed the sale prices of Microsoft and Nvidia's offerings. This suggests that in turn that consumer who desire new hardware will likely not have to pay as much extra to buy into the Xbox ecosystem or an AMD graphics card as opposed to their competitors.
In order to evaluate the worth of the model, it is a good idea to determine how much of the variance in sale prices is explained by the variance of inputs in the model. We will use the r^2 metric.
print('Estimated r^2: {}'.format(est.score(X,Y)))
Estimated r^2: 0.7561123797913689
The r-squared value given is 0.7561, which indicates that approximately 75.61% of the variation in sales prices is explained by the selection of item. This is a very good value, indicating that the individual items are not all priced the same.
Now it is time to proceed to the predictive model.
df_temp = df.drop(columns=['title', 'subTitles', 'bidsCount', 'trendingPrice', 'shipping', 'markup', 'multiplier', 'sale+ship', 'model'])
m2_df = pd.get_dummies(df_temp, prefix='is', columns=['type'], drop_first=True)
m2_df = pd.get_dummies(m2_df, prefix = 'is', columns=['company'], drop_first=True)
m2_df.head()
| sale_price | base_price | is_console | is_MSFT | is_NVDA | is_SONY | |
|---|---|---|---|---|---|---|
| 0 | 729.99 | 499.99 | 1 | 1 | 0 | 0 |
| 1 | 748.88 | 499.99 | 1 | 1 | 0 | 0 |
| 2 | 690.00 | 499.99 | 1 | 1 | 0 | 0 |
| 3 | 750.00 | 499.99 | 1 | 1 | 0 | 0 |
| 4 | 643.78 | 499.99 | 1 | 1 | 0 | 0 |
m2_df['is_console*base_price'] = m2_df['base_price'] * m2_df['is_console'] ## Create an interaction term
m2_df.head()
| sale_price | base_price | is_console | is_MSFT | is_NVDA | is_SONY | is_console*base_price | |
|---|---|---|---|---|---|---|---|
| 0 | 729.99 | 499.99 | 1 | 1 | 0 | 0 | 499.99 |
| 1 | 748.88 | 499.99 | 1 | 1 | 0 | 0 | 499.99 |
| 2 | 690.00 | 499.99 | 1 | 1 | 0 | 0 | 499.99 |
| 3 | 750.00 | 499.99 | 1 | 1 | 0 | 0 | 499.99 |
| 4 | 643.78 | 499.99 | 1 | 1 | 0 | 0 | 499.99 |
X = m2_df.drop(columns=['sale_price']) ## Predictive values are everything left but sale price
Y = m2_df['sale_price'] ## Target values are sale price
model2 = linear_model.LinearRegression()
est2 = model2.fit(X,Y) ## Fit the model to the dataset
t = 'Sale_price = {}'.format(est2.intercept_) ## Print out the model.
i = 0
for x in X.columns:
if x == 'sale_price':
continue
t = t + ' + {}*{}'.format(est2.coef_[i], x)
i = i + 1
print(t)
Sale_price = -88.46285393350286 + 2.3176610477017254*base_price + 27.4238501094635*is_console + -109.69749613840546*is_MSFT + 604.5074652606584*is_NVDA + 137.12134624785847*is_SONY + -0.48380855744902007*is_console*base_price
print('Estimated r^2: {}'.format(est2.score(X,Y)))
Estimated r^2: 0.7341917515132947
The second model has a marginally lower r^2 score of 0.7342, indicating that it can only account for approximately 73.42% or the variation in the output with variation in the input. However, this model is better suited to the task of predictive extrapolation. I proceeded to use it to make predictions about items not found in the data set. This was extrapolation and an estimate, so it should not be considered except for illustrative purposes.
## Make price predictions for items not in the data set but related to those that are.
y_bar = est2.predict([[500,1,0,0,1,500]])[0]
print('PS5 Disk (est.): ${:.2f}'.format(y_bar) + ', Multiplier (est.): {:.3f}'.format(y_bar/500))
y_bar = est2.predict([[299.99,1,1,0,0,299.99]])[0]
print('Xbox Series S (est.): ${:.2f}'.format(y_bar) + ', Multiplier (est.): {:.3f}'.format(y_bar/300))
y_bar = est2.predict([[400,0,0,1,0,0]])[0]
print('RTX 3060Ti (est.): ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/400))
y_bar = est2.predict([[479,0,0,0,0,0]])[0]
print('RX 6700XT (est.): ${:.2f}'.format(y_bar) + ', Multiplier: {:.3f}'.format(y_bar/479))
PS5 Disk (est.): $993.01, Multiplier (est.): 1.986 Xbox Series S (est.): $379.40, Multiplier (est.): 1.265 RTX 3060Ti (est.): $1443.11, Multiplier: 3.608 RX 6700XT (est.): $1021.70, Multiplier: 2.133
| Item | MSRP | Avg. Price | Multiplier |
|---|---|---|---|
| PS5 Digital | \$400 | \$809.60 | 2.204 |
| Xbox Series X | \$500 | \$746.17 | 1.492 |
| RTX 3070 | \$500 | \$1625.74 | 3.251 |
| RTX 3080 | \$700 | \$2278.61 | 3.255 |
| RX 6800 | \$580 | \$1605.05 | 2.772 |
| RX 6900XT | \$1000 | \$1967.20 | 1.967 |
| ------- | ------- | ------- | ----- |
| PS5 Disk (est.) | \$500 | \$993.01 | 1.986 |
| Xbox Series S (est.) | \$300 | \$379.40 | 1.265 |
| RTX 3060Ti (est.) | \$400 | \$1443.11 | 3.608 |
| RX 6700XT (est.) | \$479 | \$1021.70 | 2.133 |
From the models, it appears that the lowest multiplier expected would be attained by purchasing the Xbox Series S. It also has the benefit of starting at the cheapest price, so the additional cash required for the purchase should also be smaller in raw dollar terms. This lines up well with outside findings, though it understates them. During the project, it came to my attention that Microsoft had resumed taking orders for the Xbox Series S on its website, so it is actually available at MSRP if one returns to first party sellers, though shipping may take longer than the order processing. By contrast, the worst multiplier and price markup is found on the Nvidia RTX 3080, as its price multiplier is the greatest out of any item (save the predicted muliplier on the RTX 3060Ti) and in raw dollars, the average markup is greater than \$1500. On the whole, a purchase of a console is likely the better choice for an individual looking to spend less at the moment, so long as the only use case would be for playing videogames. If someone was interested in work applications, then the specific needs would override cost-saving decisions by necessity.
In summary, in order to save money, even on the second hand market, a discerning consumer would be better served by a console than a PC. The overall winner is the Xbox platform, especially the Series S, and the PC winner is an AMD GPU. Note that this PC recommendation only applies to AMD GPUs as the most recent AMD CPU has also been subject to scarcity and highly raised prices.
Due to the requirements for automation for data collection with ebay, I was not able to apply proper randomization to sampling. This couples with the lack of time data and expected data which simply did not make it into the data set to limit the reliability of the results. The conclusions were based largely on extrapolations as a result of the missing data and the limited amount of data I could collect. In order to strengthen the results under replication of this analysis, I would recommend using a different collection API than ProxyCrawl. Use of a proxy was a necessity to automate data collection on ebay, however, there were three primary drawbacks that I did not anticipate while selecting the service. First, the lack of sale date in the data forced me to cut out any longitudinal analysis of prices, which is likely a substantial component of understanding the data. Second, the proxy appears to have intermittently switched to a different country, which led to unusable pages of data, since the resulting text was not identifiable by me and interpreting the sale price would have required implementing an arbitrary currency converter. Third, there seem to have been stability issues which resulted in data for the Playstion 5 disk version and the Xbox Series S never making it into the data set or any partial backup despite being properly targeted by the scrape.